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ABSTRACT 

Video surveillance cameras generate most of recorded video, 
and there is far more recorded video than operators can watch. 
Much progress has recently been made using summarization 
of recorded video, but such techniques do not have much im¬ 
pact on live video surveillance. 

We assume a camera hierarchy where a Master camera 
observes the decision-critical region, and one or more Slave 
cameras observe regions where past activity is important for 
making the current decision. We propose that when people 
appear in the live Master camera, the Slave cameras will dis¬ 
play their past activities, and the operator could use past in¬ 
formation for real-time decision making. 

The basic units of our method are action tubes, represent¬ 
ing objects and their trajectories over time. Our object-based 
method has advantages over frame based methods, as it can 
handle multiple people, multiple activities for each person, 
and can address re-identification uncertainty. 

Index Terms — Video Surveillance, Video Synopsis, 
Multi Camera Synopsis 

1. INTRODUCTION 

Surveillance cameras are installed everywhere, and are be¬ 
coming even more popular due to lower costs of cameras, net¬ 
working, and storage. The increase in the number of cameras 
is not being offset by a proportional increase in the number of 
operators available to monitor the video, and in practice most 
surveillance video is not being viewed. The large gap between 
the availability of human operators and the need to extract the 
information in the recorded video has attracted much interest 
from the computer vision community. 

Surveillance video has two main purposes: real-time re¬ 
mote sensing and forensic historical analysis, where historical 
video is rarely used for real-time decision making. In this pa¬ 
per we suggest a novel object-based method for using past 
surveillance video for live decision making. 

One of the great challenges for human operators is be¬ 
ing able to exploit relations between the video streams across 
time and across cameras. Let us consider a library with sev¬ 
eral cameras. Some cameras view the bookshelves while one 
camera views the lending desk. Viewing each stream inde¬ 
pendently may not reveal suspicious behavior. The librarian 


can not remember all activity of all library visitors, which oc¬ 
cur at different times in different cameras. However, if all 
the bookshelf cameras delayed showing the activities of each 
reader until he reaches the lending desk, the librarian can eas¬ 
ily grasp all the reader’s activity before he leaves the library. 

The camera synchronization paradigm is quite general. 
Other cases where cross camera relationship is important are: 

• Effective Business intelligence: What items did customers 
look at before purchasing? 

• Anomalous behavior detection: Does a traveler change his 
pace before going through customs? 

• Checkpoint: A guard at the exit from a secure facility can 
observe if visitors behaved suspiciously during their visit 
before being allowed to leave. 

The relations in these cases are all object-based, which 
benefit from comparing the behavior of people across differ¬ 
ent locations and times. All such systems are hierarchical, 
where one camera is viewed in real-time (Master) while other 
cameras are of forensic significance (Slaves). The Master 
camera need not be static, and could even be a body mounted 
camera worn by the operator. In this case the Master video 
need not be viewed, as its view is the same as the operator’s. 

In standard camera networks the analyst needs to remem¬ 
ber all objects during a few hours of video, which is unreason¬ 
able. However, in a hierarchical camera system, if we display 
in the Slave videos the previous actions of all persons cur¬ 
rently observed in the Master camera, the operator will need 
to remember only a few seconds of video from the Slave cam¬ 
eras for understanding the activity in the Master camera. This 
motivates Live Video Synopsis (LVS). In LVS activity tubes 
are initially extracted from all Slave video streams, and per¬ 
sons are identified and labeled. Tubes are then shifted in time 
in the Slave videos to be displayed only when the person is 
observed in the Master view. 

Notably, we shift Slave tubes in time but do not attempt 
to bring tubes from different cameras onto the same screen or 
shift tubes spatially as object tubes might be placed on seman¬ 
tically unrelated backgrounds sometimes with absurd results 
(e.g. people floating in mid-air). Also changes in geometry 
between cameras can cause some tubes to look unnatural and 
out of place (e.g. front and side views). 

Live Video Synopsis has the following benefits: 1) The 




Fig. 1 : Video Synopsis: (a-c) Original frames, (d) Synopsis 
frame. Objects from different times appear simultaneously. 


relations between persons observed at multiple cameras are 
clearly visible to the operator. 2) Multiple persons and his¬ 
tories can be observed on the same screen. 3) In cases of 
re-identification uncertainty, multiple possibilities can be dis¬ 
played. 4) The information can aid live decision making. 

2. RELATED WORK 

Much work has been done on understanding surveillance 
video. Popular approaches include the classification of activ¬ 
ity as normal/anomalous (lJO, or using activity recognition 
to transcribe surveillance video into words EEQ. High-level 
activity understanding is a very promising research direction, 
but current performance has room for improvements. Realiz¬ 
ing that the need for human inspection of video will remain 
for some time, many methods create visual summaries for 
faster viewing. 

One approach for visual summarization is the generation 
of a storyboard by selecting some key frames mm. Another 
approach is adaptive fast forward (T), dropping frames at dif¬ 
ferent rates depending on how interesting the video is. Video 
synopsis E13 HMD, shifting activities in time so that as 
many activities can be presented simultaneously, presents all 
activities of a video in a much shorter video. See Fig[Tj 

Single camera approaches for summarization do not gen¬ 
eralize well to multiple cameras, as they do not take into ac¬ 
count the relationship between the different cameras. Some 
work addressed video captured by several overlapping cam¬ 
eras (33- But this work can not be used with most cameras 
which are mostly non-overlapping. 

Representation of the video from non-overlapping cam¬ 


eras has received little attention, a notable exception is fl3l . 
which projects multiple video cameras on a 3D model of the 
environment. But such a 3D model is not generally available. 
Another interesting work has been done by [14j, who have 
recognized the importance of using objects for highlighting 
relationships between video streams from multiple cameras. 
Their work however has concentrated on the extraction and 
indexing of objects rather than on visual representation. 

A somehow related approach is Multi-Video Browsing 
and Summarization G3, which attempts to synchronize 
video streams by shifting frames in time, so that visually 
similar frames are observed in all videos at the same time. 
This scheme measures similarity by a set of trained visual 
similarity descriptors among frames, in contrast to our work 
which is object based. 


3. LIVE VIDEO SYNOPSIS (LVS) 


The generation of LVS consists of three stages: Preprocessing 
(Sec. |3.1| ), Optimization (Sec. |3.2| ), and Display (Sec. |3.3| ). 


3.1. Video Preprocessing 

Before selecting the Slave action tubes corresponding to 
the persons observed by the Master camera, several pre¬ 
processing steps are required: 

1. People are detected and tracked in all slave video streams. 
Each person is represented as a space-time “tube”, which 
is the union of all pixels of this object in each frame. Rel¬ 
evant literature on object detection using background sub¬ 
traction appears in | f6l H2L and tracking objects across 
frames appears in [10 , 14 ]. The extraction of video tubes 
is depicted in Fig. [2] 

2. People are detected in the current frame of the Master 
stream. There has been much work on human detection 
ED and in particular on Pedestrian detection cna. 

3. Re-identification of people between the Master stream de¬ 
tections and the tubes extracted from the Slave streams is 
performed lf20l [2T1 [22l . Re-identification scores between 
two objects are often given probabilistically e.g. (22). 


3.2. Slave Action Tube Selection 

In this section we assume a camera system consisting of one 
Master camera of real-time importance and one or more Slave 
cameras of forensic importance. We propose to detect people 
in the Master camera stream at fixed time intervals, and play 
for each Slave camera, the activity tubes from the past that 
contain the observed people. 

Pre-processing is done as described in Sec. |3.1| At fixed 
intervals of length ST the Slave action tubes to be displayed 
in each Slave video v are selected. The task is to select a set 
of tubes S v to display in Slave video v out of the total set of 
tubes in the Slave view B v (S v C B v ). There are three factors 
that are taken into account: i) displaying the maximal number 


















Fig. 2: a) A video showing a single object, b) Tubes are bi¬ 
nary masks representing an object, containing all pixels in all 
frames belonging to the object. 

of Slave tubes containing the people observed in the Master 
camera; ii) minimizing tube collisions; iii) a stable viewing 
experience: minimizing the number of tube switches at each 
interval. 

This can be formulated using two energy terms, a collision 
term and an identical object overlap term E °: 

E t v ( s v ) = a ■ E°{S V ) - E°{S V ) I S V CB V (1) 

We do not explicitly take into account the relations be¬ 
tween different slave videos. The energy terms E° for Slave 
videos v are optimized independently of other Slave videos. 

3.2.1. Collision Cost 

The objective of E c is to minimize collisions between action 
tubes placed in the generated videos. A small number of col¬ 
lisions can be tolerated, and it can greatly increase the number 
of Slave activities displayed simultaneously. The number of 
collisions that can be tolerated can be modified by adjusting 
a in Eq. [I] 

Let tube b be defined by binary function V, t) indi¬ 
cating if the pixel (x, y ) in frame t is active for tube b. 

Given a slave camera, the collision cost for its generated 
slave video v is defined in Eq. [2] (similar to iflOll ): the num¬ 
ber of colliding pixels among all pairs of different tubes in the 
video. We add a discount factor for collisions that are forecast 
further away in the future as we become increasingly uncer¬ 
tain that the tubes will not be terminated before the forecast 
collision (due to new persons appearing and old persons dis¬ 
appearing in the Master view). The amount of discounting is 
determined by factor d. 

Ey{S v )= Y Y Xh( x ,y,t) -Xb^’V’t) ■ & ( 2 ) 

b,bes v x ^y3 

where S v is the set of tubes chosen for display in the output 
Slave video v at the current time interval from the total set 


3.2.2. Identity Cost 

The person identity cost in Eq. [^encapsulates several require¬ 
ments: i) displaying the Slave tubes having the highest prob¬ 
ability of correspondence to the people detected in the Master 
stream (this set is labeled O). ii) making the number of tubes 
corresponding to each object in the Master frame roughly 
equal, iii) encouraging retention of already playing tubes for 
smoother viewing. This can be formulated as: 

E?(S V ) = Y /E (1 + /3- l 66 sj-0-n,o (3) 

oeo y bes v 

Where S'* -1 is the set of Slave tubes selected in the last in¬ 
terval, and (3 is a constant determining the strength of the 
preference to retain old tubes. The square root encourages 
the display of all objects in roughly equal numbers, otherwise 
most tubes may come from the same most likely object. When 
the Slave action tube and the person appearing in the Master 
camera are different persons this term has little effect, as the 
probability P k j0 will be low. 

3.2.3. Cost Minimization 

The energy for each slave camera as expressed in Eq. [I] can 
be minimized using standard discrete optimization methods. 
However the fast greedy approach described below generated 
good results as well. 

• For all Slave videos v 

• Set S v = (j) 

• Set list L = B v (all tubes for video v) 

• Until no tubes left in L: 

1 . For each tube b G L calculate the approximate decrease 

overlap energy E.go yg— 

2. Select tube b with the largest decrease. 

3. If the sum of collisions between b and the tubes in S v is 

smaller than threshold r: if J2bes v ^x, y ,t Xb(x, V , t) • 
Xb( x i Vi t) • < r, add b to S v 

4. Remove b from L. 

We use the binary update rule every ST seconds and display 
the tubes {b\b G S v } in slave view v. 

3.3. Synopsis Display of Slave Cameras 

LVS is now generated for each slave camera v by placing ev¬ 
ery tube from S v with the correct temporal offset (in case it 
has already been playing) over the stationary background of 
the corresponding Slave video. We emphasize that a synopsis 
video is created for each Slave camera. Tubes are not trans¬ 
ferred between cameras, nor shifted in space. This ensure 
that all objects remain on their original background and ge¬ 
ometries, creating videos that are easy to understand. 













Tube Inclusion: Frame Based vs. LVS 
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Fig. 3: A sample frame from the output of the MS algo¬ 
rithm on a Store scene. The left image is the Master cam¬ 
era and the right image is the Slave. Tubes corresponding 
to the two persons in the Master view were rendered si¬ 
multaneously in the Slave videos. The clip can be seen at: 
http: //www. vision .huj i. ac. il/sy nc vid/ 



Fig. 4: A sample frame from the output of the Master- 
Slave algorithm run on a Stadium scene. The top video is 
the Master camera and the bottom two are Slaves. Multi¬ 
ple tubes corresponding to the person in the Master view 
are displayed in the slave videos. Many matching tubes 
were found and are displayed simultaneously. This cannot 
be achieved by frame-based methods. The clip can be seen 
at: http://www.vision.huji.ac.il/syncvid/ 


4. EXPERIMENTS 
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Fig. 5: a) Comparison of tube inclusion rates of LVS vs the 
frame based method for the three Slave videos. Significant 
improvements have been obtained, b) The Collision rate vs. 
the Tube Inclusion rate for the three Slave videos. 65-85% of 
tubes can be included for a modest collision cost. 


A very modest collision rate (2%) is required for displaying 
65-85% of relevant tubes. 

Several benefits of LVS are apparent: 

1) While concentrating on the Master camera, we are able 
to see much history of the objects in the Slave cameras. This 
can be of great utility for letting operators make decisions in 
real-time. 

2) In many cases the Master stream is sparse and contains 
only a small number of objects, it is possible to display sev¬ 
eral candidate tubes for each object in the Slave nodes. This 
is helpful as in many scenarios, the top re-identification re¬ 
sult has about 30% recall probability, but the top 5 candidates 
have an accumulated recall probability of above 65% f22l . 
Showing as many candidates as possible therefore increases 
the likelihood of seeing the whole history of the object across 
the scene. 


We present frames from two scenes, demonstrating the out¬ 
put of LVS. The Store Scene was recorded by two non¬ 
overlapping cameras in a store (Fig. [3]), the Stadium was 
recorded by three non-overlapping cameras around a stadium 
(Fig.0. Tubes were extracted by state of the art background 
subtraction method such as E3l . Tubes were manually re¬ 
identified between Master and Slave tubes. Our method 
was then run using the following parameters: /3 = 0.5, 
ST = 1 second , r = 15, d = 0.978. The output clips can be 
seen at http://www.vision.huji.ac.il/syncvid/ 

Fig. [5] a) shows a comparison between a frame-based 
method (showing the whole frames of the highest ranking 
Slave action tube) and our object-based method - LVS. The 
frame-based method was able to display only 15-30% of the 
relevant Slave tubes, whereas our method was able to dis¬ 
play 65-85% of Slave tubes with minimal collisions. This 
performance-gap is expected to increase further when re¬ 
identification uncertainty is significant. 

The trade-off between collisions and number of relevant 
Slave tubes can be seen in Fig.[5]b) for the three Slave videos. 


5. CONCLUDING REMARKS 

Live video synopsis is a novel object-based method for using 
summarization of previously recorded video for aiding live 
decision making. It was shown that our method has many 
advantages over frame based methods. Although in this pa¬ 
per we have concentrated on people, this method is general 
and can be used for any type of object that can be detected 
and re-identified across cameras (animals, cars, etc.). As our 
method relies on having a reliable object re-identification al¬ 
gorithm, improvements in person re-identification from video 
will increase the reliability of our method. More interestingly, 
our method can be used to display object re-identification ex¬ 
amples for active learning algorithms. This can be used for 
obtaining interactive feedback from the operator for refining 
video re-identification performance. 
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